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Abstract 

The problem of proximity search in biological databases is addressed. We study vector 
transformations and conduct the application of DFT (Discrete Fourier Transformation) and 
DWT (Discrete Wavelet Transformation , Haar) dimensionality reduction techniques for DNA 
sequence proximity search to reduce the search time of range queries. Our empirical results on a 
number of Prokaryote and Eukaryote DNA contig databases demonstrate up to 50-fold filtration 
ratio of the search space, up to 13 times faster filtration. The proposed transformation techniques 
may easily be integrated as a preprocessing phase on top of the current existing similarity search 
heuristics such as BLAST[1], PattenHunter[ll] , Fast A [17], QUASAR[4j and to efficiently prune 
non-relevant sequences. We study the precision of applying dimensionality r^eduction techniques 
for faster and more efficient range query searches, and discuss the imposed trade-offs. 

1 Introduction 

Discovering the structure, function and evolutionary relationship of genes are the main goals of 
genome sequencing research, where comparative analysis of homologous sequences is a crucial part 
in the study of gene function, known as genomics. The behavior resemblance of two DNA sequences 
of two different organelles or species to the same external exposure, may be used to infer functional 
or structural similarities, or mutual inclusion in the same pathway or biological mechanism. Some of 
the vast applications of proximity search include discovering the nature and functionality of human 
genome, phylogenetic analysis, drug discovery, keyword search in databases, or user pattern analysis 
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in the context of network security^ and many more. For instance, approximate sequence analysis 
has assisted the detection of certain strains of the Escherichia coli(E.coli) bacteria responsible for 
infant diarrhea and gastroenteritis. The researchers at the University of Chicago's Howard Hughes 
Medical Institute [16] discovered a protein molecule able to transmit a genetic trait without DNA 
or RNA in yeasty which is able to string itself together into a long fiber, much hke those found 
in the brain in mad cow and human Creutzfeldt- Jakob diseases. In general, some of the typical 
applications of sequence proximity search include[3]: 

• Identification of highly conserved residues/motifs which are likely to correspond to essential 
sites for the structure or function of the sequence. 

• Phylogenetic analysis which refies on neighbor sequence search, at the protein or DNA level, 
to predict mutations from which it is possible to retrace evolutionary relationships among dif- 
ferent genetic sequences. Similarly, phylogenetic trees provide the information to reconstruct 
the history of species and gene families. 

The proximity search seeks the sequences close enough to a given query sequence either through 
direct alignment [15, 18] or using other heuristics [1, 11, 17, 19]. The alignment of biological sequences 
(pairwise or multiple alignment) is the operation to place nucleotide or amino acid residues in 
columns inferring the closest common ancestral relationships. This is achieved by introducing gaps 
(with predefined costs), to represent insertions or deletions (so called indels), into sequences. Hence, 
an alignment is a hypothetical model of mutations on the residue level through edit operations 
(Replacement, Insertions, Deletions). The best alignment usualty refers to the one demonstrating 
the most likely evolutionary scenario. Let Si,S2 G to be finite ordered DNA sequences of 
characters(6a5e5) taken from the alphabet set E, where E = {A, C, G, T}. Each pair of characters 
from T, are assigned a replacement (substitution) score/cost and the substitution matrices [6, 8, 20] 
providing such information are built based on the structure similarity and replacement fikefihood 
of the residues. For instance, at the DNA level, probabilities of substitution vary according to the 
nature of the base pairs. Notably, transitions (substitutions between two purines, A and G, or 
two pyrimidines, C and T) are generally more frequent than transversions (substitutions between 
a purine and a pyrimidine). Hence, the optimal alignment of sequences 5*1 and 5*2, named S^^ and 

is achieved by applying the minimum number of edit operations to transform into 5*2, caUed 
Edit Distance, or ED{Si, 82)- The pairwise optimal alignment is based on dynamic programming 
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ensuring the maximal score (or the minimal cost) and requires 0{pq)-time, and O (jog') -space, using 
dynamic programing [15, 18] algorithm. An example of the described procedure is depicted in the 
following example: 

51 AACTCGAGACCC 

52 ATCCGAGAGGTCCC 

^ 

S[ AACTCGAGA CCC 

R D III 

S!^ ATC-CGAGAGGTCCC 
where R, D, and I correspond to Replacement^ Deletion and Insertion respectively. Computing 
the optimal alignment of n sequences, each of length I requires o(2'^Z^)-time and o(Z^)-space. Unfor- 
tunately, such an algorithm is neither practical nor scalable. Following is a summary of some of the 

problems encountered in the sequence proximity search within the context of biological sequences: 

• The quadratic computational complexity of the optimal ahgnment is so high, making it im- 
practical. 

• Due to the limitations on the current knowledge of mutations and their corresponding proba- 
bilities, only approximate searches and heuristics[l, 11, 17, 19] have been practically applied 
for comparison of sequences. 

• Scalability is one of the most important issues to be addressed. Neither the Dynamic pro- 
gramming algorithms nor the heuristics may practically be appUed to a large number of 
sequences (across species), each of which might be composed of biUions of residues. 

The rest of the paper is organized as foUows: Section 2, discusses the background and related 
work, followed by the motivation and terminology in section 3. Section 4, studies the proposed 
transformation techniques and their integration. Section 5 demonstrates a concise empirical per- 
formance analysis and the simulation results. Finally, section 6 concludes the work. 

2 Background, Related Work 

In a typical application of range query, given a protein or DNA query sequence Q and range r, 
it is compared with all the sequences in the database, in search for sequences which are at most 
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r edit operations far from query Q, However as mentioned before, because of the quadratic time 
involved, the dynamic programming[18, 15] algorithms may not be directly or practically applied 
for this purpose. Several heuristics [1, 11, 17, 4] have been proposed to speed up the homology 
search procedure, which are not efficient for range query over large datasets. These heuristics need 
to inspect the entire database while only a very small part of it might actually be of interest. The 
rest of this section highlights the recent research addressing the similar problem. 

Multi-Resolution index Structure(MRS)[10] is a technique based on Haar wavelet, designed to 
speed-up range queries. It uses a sliding window of size \w\, moving over the query sequence and 
for each possible location, extracts the first and second Haar wavelet coefficients of the |if;|-sized 
subsequences/ windows. Hence each window is mapped to a point in 3?^. Furthermore, every c trail 
of windows is represented with a single Minimum Bounding Rectangle ( MB R). The trail of MBRs 
are subsequently processed at different resolution levels, based on different values for \w\. Given a 
range query (Q,r), the query is first divided into the maximum 2*-sized postfix segments, with the 
assumption that the query sequence is originally of length 2^ for some j > and furthermore, each 
segment is searched within the respective resolution level. However, i) the lower bound provided 
by their Frequency Distance (FD) function is not tight enough for score and ED estimations and 
using more coefficients, MRS does not guarantee a better precision, contrary to what is expected 
from a distance preserving transformation, ii) the focus of the work is on the performance of the 
index structure and does not analyze the filtration efficiency of using wavelet transformation on 
biological data. 

Chavez and Navarro [5] translate the problem of approximate string search into a range query 
or proximity search in a metric space. The technique is based on picking k pivots randomly, and 
mapping each sequence with a fc-dimensional vector(only keeping rnin and max distances), further 
using triangle inequality to prune non-relevant sequences using Suffix Tree [2] as index structure. 
No empirical analysis is conducted to evaluate this approach on real biological data. SST[7] uses 
overlapping sliding windows of size w over the database sequences and maps them into a 3?"^^- 
dimensional frequency vectors. After that, SST uses fc-means clustering algorithm, hierarchically 
clustering database sequences. Given a query it is first divided into non-overlapping windows, 
pruning the database windows which are farther from the given query range, and finally studying 
the effect of window size on search time, and error rate of input data on true positive/negative 
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rates. Finally, authors in [21] provide a concise study of DFT/DWT transformations, but only in 
the context of time-series databases. 

In this work, we study the effectiveness of different edit distances, and the application /integration 
of DFT/DWT for the purpose of sequence proximity search specially in the context of biological 
databases. 

3 Motivation, Terminology 

Definition 1 Let T = Ti, . . . , T/v be a sequence database over the alphabet Let Tij denote a 
subsequence ofTi starting at index for 1 < i < N and 0 < j < IT^I^ where < \Ti\— j. Given 

a query pattern Q E Tl'*' and a range r, a Range query is the problem of finding the Result set 
R(Q, r) as the set of all the subsequences Tij in T such that ED{Q, Tij) < r, where ED corresponds 
to the Edit Distance. 

One way to solve the Range query problem is as follows: Given a query pattern Q, compare 
all sequences stored in the database against Q using ED^ either through direct application of dy- 
namic programming[15, 18] or other popular heuristics[l, 11, 17, 4], and determine the set R{Q,r). 
Although this approach is correct, it is not practical/scalable for two reasons. First, sequence 
databases may involve a large number of very large sequences (e.g. Chr 22 as the smallest human 
chromosome[14] that consists of approximately 35 miUion base pairs) resulting in severe perfor- 
mance penalty. Secondly, the prohibitive computational cost of alignment or even heuristic-based 
sequence comparison makes it impractical, specially when \R{Q,r)\ is very small, compared to the 
total number of subsequences in T. 

A solution could be mapping the sequence similarity (using Edit Distance(ED) for Red{Q,^)) 
problem into a numerical/ vector difference (using Frequency Distance(FD) for Rfd{Q^^)) problem 
to benefit from much more time/space-efiicient numerical methods in the literature. One way 
is to use a mathematical transformation to map the sequence domain{o{ sequences Si) into a 
vector/frequency domain{of frequency vectors f{Si))^ and use an appropriate frequency distance 
function to estimate the edit distances of the sequence domain. If the right transformation is 
applied, then Parseval Theorem implies that Frequency Distance is less than or equal to Edit 
Distance^ or for short FD{f{Si),f(Sj)) < ED{Si, Sj) (Distance preserving transformation), with 



5 



the equality holding when all the transformation coefficients are used in the frequency distance 
calculations. This property is the main driving force behind using transformation. Furthermore: 

• The calculation of distance in the frequency domain (FD), is much more time /space- efficient 
compared to the calculation of the distance in the original sequence domain(^Z^). 

• Range queries are much more efficiently evaluated in the frequency domain. For instance, 
suppose having a query pattern frequency vector f{Q), range r and a set of frequency vectors 
f{Si),...,f{Sn), then all the frequency vectors(and in turn sequences) f{Si<i<rt)^ where 
FD(f{Q),f(Si)) > r may be pruned from the answer set without the need to investigate 
further (to calculate the original ED), at a very low cost. This would dramatically reduce 
i) the computational cost[10] and, ii) the required amount of search space RF£)(f{Q),r), for 
a given range query where Red{Q')'^) ^ RpDifiQ)-)'^)' However, a very important 
requirement is to guarantee that the ratio ^^\R^^^{Q'^r)\^ ^ 1? to incur any false negatives, 
while being as small as possible for the better filtration ratio. 

Following definitions introduce the steps in transforming the original do main (set of seque nces) 
to frequency domain(set of vectors): 

Definition 2 Let S = be a sequence over the alphabet Efc = {cui, . . . , Oifc}^ then the 

frequency vector of Sj called f(S) is defined as: f{S) = [/i, • • • , where fi(> 0) corresponds to 
the occurrence frequency of ai in Sj and Yli^i fi = 1^1 = 'f^- Similarly j is defined to be a \T,\ x 1 
vector where the entry corresponding to cxi E Tik? for i < k, is 1 and all other entries filled with 0, 

or in other words: 



fi 

f2 
fk 



, where Vi < fc, fi 



1 ai 

0 otherwise. 



For instance, let S = AGGTTGCAATTA be a sequence over = {A,C,G',T}, then f{S) 
[4, 1, 3, 4], and = [1, 0, 0, 0], = [0, 0, 1, 0], . . ., ^f^l = ^12 = [1. 0. 0. 0]. respectively. 
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Definition 3 Let S = 5i, . . . , 5^ he a sequence from the alphabet T^k? Frequency- Quantization of S, 
called = [5^, . . . , 5^] is a sequence of x 1 vectors 5% one for each ordered base of the sequence, 
where = ^g., for 1 < i < n. 

Let S be the same sequence as given above, then 

100000011001 
000000100000 
011001000000 
000110000110 

Following definitions, capture two of the famous distance preserving transformations which we 
deployed in our study. 

Definition 4 The k^^ -level Haar Wavelet Transformation(DWT)[10] of a frequency- quantized se- 
quence S. vj]^{S), for 0 < k < log2Tiy is defined as vjj^{S) = [^^^,05 • • • 5 '^fc — ]^ where v^^i = 
[c^k,i,Pk,i]^ for 

I f{ci) k = 0 

I OLk-i^2i + 0 < fc < log2n, 



0 k = 0 

otk-i,2i - 0 < fc < log2n, 



where for k = log2n: aiog^ri,o = f{S[0 : n - 1]) and piog^n^o = f(S[0 : f - 1]) - : n - 1]) 

represent the first and second Haar wavelet coefficientSj respectively. 

For instance, for the same S as given above, the S^'^^-level DWT of S: vj^{AGGTTGC AATTA) = 
{q;3,05 /?3,o} = {[4, 1, 3, 4], [—2, —1, 3, 0]}, represents the set of first and second wavelet coefficients. 

Definition 5 The n-point Discrete Fourier Transformation (D FT) of a sequence S = [St]^ for 
t=0,. . . ,n-l is defined to be a sequence X of n complex numbers Xf of \Ti\ x 1 vectors, for f = 
0, . . . , n — 1^ and is given by 

Xf^^ T.7=o Ste^^, f=0,l,..., n-1, 



where j = is the imaginary unit. The original sequence S can be restored by the inverse 

transform: 

St = ^ TJfZl Xfe^, t = 0, 1, n-1, 

where Xf is a complex number and its real and imaginary parts are x 1 vectors. 

Let 5" = ACCT, the first and second DFT coefficients of 6" are calculated as: Xo(5") = 
1,0, ^] and Xi(50 = {[i ^,0,0], [0, ^,0, i]}, respectively. 

Meanwhile, There is one question to be answered: What would be the proper FD distance to 
deploy in the frequency domain to provide a better /tighter approximattion of the Edit Distance of 
the original space? 

Within the context of frequency transformations in multi-dimensional indexing, Lp-norm(p > 0) 
distance measures are usually the popular choice for frequency distance function. However, this 
choice is very much application-dependent. Hence, we decided to run exhaustive experiments on 
DNA sequences using Li-norm, L2-norm, {FDi, FD2)[10\ distance functions, to analyze the accu- 
racy level of each of them. We should use the FD which demonstrates a tighter bound estimate of 
the ED in the original domain, or in other words ''^more precisely^^ reflecting the similarity /distance 
across the sequences. 

Figure 1, shows the average distance comparison resulting from running an All-Pair- All exhaus- 
tive frequency sequence comparison [18] on the contig sequences of Alu, Mitochondria, Escherichia 
coli{E .coli) , Yeast and Drosophila[14]. All the contigs are first transformed into frequency do- 
main using wavelet transformation and then the distance across the extracted vectors for the first 

and second coefficients are calculated using the mentioned distance functions. Li-norm empiri- 
cally demonstrates a tighter bound(higher distance), compared with {FDi,FD2)[10] and L2-norm 
on all the databases. Higher the FD^ a better estimate of ED is achieved, while FD < ED by 
Parseval Theorem. For any two frequency vectors X^Y: Li-norm as the desired frequency dis- 
tance is defined as L\{X,Y) — J2j \^j ~ Vjl^ ^i^d may be expressed as the minimum number of 
increment / decrement (dzl) operations needed(on the entries) to transform vector X into vector Y, 
which is a lower bound on the original domain's ED distance. Li-norm is also expected to have a 
lower computational cost compared with the others. 
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Figure 1: Average distance comparison resulted from contig DNA databases of Alu, Mitochondria, 
Escherichia coli{E .coli) , Yeast and Drosophila[14], using Li-norm, L2-norm and (FDi, FD2)[10]. 

4 Transformation procedure 

We perform the process in two different stages (depicted in Figure 2), offline and online, as follows:. 
• Preprocessing phase(Offline): Given the database of sequences T = {Ti,T2, . . . ^T^}, \/ Ti G 

1. Let Ts be the sequence with the minimum length, say m. Choose the window size to 
yield an optimal total number of blocks on that sequence as j = 6{m/logY:N)[12, 13], for 
database size N. This optimal j has been suggested for pattern partitioning, however 
the length of the pattern might not be known in advance, so we restrict the maximum 
size of the pattern by the size of the minimum sequence in the database, to be able to 
benefit from the optimal partitioning. 

2. Slide the window on each of the original DNA sequences T^, and extract the correspond- 
ing |it;|-sized blocks. Partition each Ti on the starting positions of 0, , . . . into 
a total of ^"^"^7^7^"^"^ blocks. Let bij denote the block extracted from sequence T<j, at 

l-~2"J 

position j, where 1 < i < n, and 0 < ^ < \Ti\ — \w\. 

3. Perform Frequency- Quantization(Def. 3) on each of the subsequences /blocks bij, ex- 
tracting bfj, 
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4. Use the desired DFT/DWT transformations on each of the Frequency- Quantized blocks 
bfj, and calculate the corresponding transformed vector X{bfj) or vj{bfj) coefficients in 
the frequency domain. 

5. Extract and store only a few coefficients (at most three) to represent the original subse- 
quence/block. For the case of DFT, we keep the highest energy-concentrated-coefficients 
as, first, last and the second[21]. For the DWT^ we keep the first and second coefficients, 
in which the energy of the sequence is expected to be mostly concentrated. 

6. Build an offiine index structure as follows: For each of the sequences in the database, 
Ti, keep a list of block vectors contained in that sequence. We keep only an index for 
the location of the extracted block, and the corresponding fixed-size frequency vector (s), 
which are at most two for DWT, and three for DFT. The higher precision might be 
achieved by choosing more coefficients, as needed, which is a built-in feature in our im- 
plementation. 

• Query processing (Online): Given a query pattern Q G T,* and range r: 

1. Slide the | it; | -sized window on the pattern sequence partitioning it into non-overlapping 
segments of length \w\, for a total of = LI Ql/ 1^1 J partitions. Let Qi denote the parti- 
tion of Q starting at index Z, where 1 < I < \Q\ — \w\ +1. 

2. For each of the extracted partitions Qi: 

— Perform Frequency- Quantization (Def, 3) on Qi partition/block, and calculate Qf, 

— Use DFT/DWT transformations on the Frequency- Quantized blocks Qf, and ex- 
tract the corresponding X{Qf) or vj{Qf ) coefficient vectors, 
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Figure 2: The transformation procedure. 



— Search each of the coefficient block vectors, bfj, stored in the offline index, and 
prune all the subsequences /blocks which, 

* DFT: FD{X{Qf),X{bl^)) > f, or 

* DWT: FDi^iQf),zu{b(^)) > f. 

3. Calculate ED only for the candidate result set (those not pruned), to find the subse- 
quences Si, where ED(Q,Si) < r. 



5 Performance Analysis 
5.1 Implementation 

We compared the application of the transformation techniques with two other approaches. The 
first one so called String is the g-gram indexing method used by QUASAR[4], and the second one 
so called Vector, is a more space efficient (trading for time) version of String for the inverted table 
index structure that is used in the approach. 

The String^ uses a block addressing scheme as follows: Each of the contigs of the database, 
and the pattern sequence, are partitioned into blocks of fixed size |w|(as calculated in the previous 
section), for a total of B blocks in database, and a counter C^,. is associated with each block hi of the 
database, respectively. For the g-gram of size g, an index structure of size lEl"^ is maintained(g = 3). 
Each entry corresponds to a unique g-gram, followed by a list of blocks containing that particular 
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g-gram. All the g-grams of the blocks of a pattern are inspected(consecutive g-grams of the blocks 
overlap inq—1 bases), and the counter C5. is incremented whenever a search for that g-gram reports 
^' existing^' in the 6^. After processing all the blocks and the corresponding g-grams of the pattern, 
each counter C7j>. (where 0 < ^ < B ) indicates how many unique g-grams (ignoring the positional 
information) from Q are contained in that block 6^, of the database. Thereafter, all the block 
counters are stored in an array of size \T\/B, and the blocks 6^, whose counters C^. contain (share) 
less than max(\Q\, |6^|) + 1 — (r + 1) • |g| of g-grams with the pattern, are pruned from the candidate 
set[9, 4]. We used a uniform blocking method across all different methods. Note that String q- 
gram method is an approximate method, in contrary with dynamic programming algorithm, and 
potentially suffers from false positives. 

The second approach called Vector, is very similar to the String method. However, the index 
structure saves on the total amount of space needed to keep for each block. Accordingly, for each 
block bi, the corresponding frequency vector (Def. 2) f{bi) is calculated. Each f(bi) is mapped into 
an integer as follows: Let f{b) = [/i, /2 5 /s? /4]5 then the corresponding identifier for 6^, called 
is defined as 

/6. = i:A-ii^>iii^i-^ 

k=i 

The same identifier translation was used to map each g-gram to an integer. This would decrease 
the number of entries in the index structure exponentially (for q ^ 3) which would be of size g'^L 
Each g-gram/block vector v is mapped into an integer value /^,, where < I^, < |u|I^L However, 
this is not a 1-to-l mapping, therefore only the nonzero entries are kept. The rest of Vector process 
is exactly like String. However, as expected, it is supposed to incur more false positives, on average, 
compared with String, due to the degeneracy of our mapping and loss of the positional information 
during translation. 

We also implemented a third improvement, called Tuple. Each of the blocks are stored as a 
I S I "^-dimensional frequency vector (zero entries might be neglected), where each entry i corresponds 
to the quantity of the unique g-gram g^ in that block. As for the index structure, we would not 
need any space to keep for the g-gram index as in String[4], while the block transformed vectors 
already include its containing g-gram information and the corresponding counters per g-gram is 
implied by each entry. This technique provided exactly the same result as String, however was 
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much more time-efficient. Hence, it is not included in our pruning graphs but is included in our 
timing comparison table(Table 1). 

For a database of N sequences, containing B blocks, String constitutes {N-\- ^)(int) = 0{B){int) 
space(5 » AT), meaning linear in space. However its computational cost is 0{\Q\B). Similarly, 
Vector has the same computational complexity, however, being exponentially more space-efflcient 
(only for q ^ 3) at the cost of more false positives on average. The amount of saved space would 
be approximately equal to 



Tuple is 0(|S|^ • B)-space and 0(B)-tiuie which is linear in total number of blocks. We could as 
well use a tree-based approach[7] and reduce the search time to 0{logB). With regards to all of 
the above methods, we also incorporated different blocking and g-gram partitioning methods: 

• Incremental partitioning: Each of the consecutive g-grams /blocks of length overlapping by 
t — 1 residues, 

• Half Overlap partitioning: Each of the consecutive g-grams /blocks of length ^, overlapping by 
t/2 residues, and 

• non- overlapping partitioning. 

On both of g-gram and block partitioning, more the g-gram/blocks extracted, we observed 
higher computational cost, better filtration ratio, tighter FD bound and a smaller candidate set. 
This choice is a trade-off between cost versus precision. However, due to the limitation of the 
space, we did not include those results in this study. We implemented all the desired algorithms 
and transformations using Java, and ran our simulations on a PIII-800Mhz with 1GB of main 
memory. 

5.2 Considerations 

Following observations should be expected regarding any transformation technique: 

• Frequency Distance (i^D) should be a fair approximation to the original Edit Distance(^I>). 

• More representative coefficients chosen in the frequency domain, then a higher precision on 
the real distance approximation and more filtration efficiency, is to be observed. 
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Figure 3: The resulting candidate answer set as a function of query range on Alu database. 

• Less the FD^ a more compact space is to be observed. This property relies upon the fact that 
a decrease in FD value, would result in sequence vectors being located closer to each other in 
frequency space. We could as well store a trail of quantized vectors as a Minimum Bounding 
Rectangle(MBR), in which case, the number of MBRs needed to store the sequence vectors 
of the data in frequency domain will be minimized, at the cost of less efficient filtration ratio. 

• The calculation of FD should be, computationally, as cheap as possible, and keeping a mini- 
mum number of coefficients, should satisfy achieving a resonably efficient filetration. 

• Filtration Ratio is to be maximized(incurring no false negatives), however the efficiency 
of pruning depends on: i)the structure of sequences, ii) Query sequence, and Hi) Query range. 

5.3 Simulation Results 

Let o, o and denote the use of V\ {V^ + 2^^) and (l'"^ + 2"^ + Last) coefficients of the trans- 
formed sequences, respectively. Figures 3-5, demonstrate the result of running String, Vector and 
application of DFT / DWT transformation techniques for the choice of different coefficients on Alu, 
Mitochondria, and Escherichia coli(E.coli) contig databases[14] for a random query pattern of 
length 16, as the query range varies from 4 to 14. 

Let B denote the total number of blocks in the database. Vertical axis shows the fraction of 
the database that is left for further investigation (those not pruned), that is i^±i^i^ShL}i%^ Qj^^i 
horizontal axis shows the corresponding query range. In Figure 3, as expected Vector gives more 
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Figure 4: The resulting candidate answer set as a function of query range on Mitochondria database. 



Species 


String 


Vector 


Tuple 


DWTl,,o 








Alu 


4.67 


5.09 


5.5 


4.62 


4.5 


5.21 


6.92 


Mitocondria 


1921.7 


3209.9 


176.74 


152.3 


152.6 


175 


236.2 


E. coli 


534.7 


929 


259.59 


221.3 


224.9 


289 


369.9 



Table 1: The timing comparison (in seconds) of running 11 different range queries on the described 
techniques, for three contig datasets[14], for a random query of length 16. 

false positives, and DFTl^^^ demonstrates the best filtration, for (23.37)"-'^ < FRaiu < (0.07)""'^ 
incurring no false negatives for the inspected query range. On the other hand, DWTl^^^ gives 
suspicious false negatives compared with String^ which indicates that it does not have a very good 
performance. We investigate this behavior later in the section. A similar behavior is observed 
in Figure 4, DFTl^^^ gives the best filtration with no false negatives, for (47.13)"^ < FRmuo < 
(17.54)~^, but DWTl^^<^ incurs suspicious false negatives as before. Figure 5, depicts the best 
expectations, no false negatives of any kind on any of the transformations. Again DFTl^^^ gives the 
best estimates with (29.21)"-'^ < FRe.CoU < (0.02)~^. In all the figures, using more coefficients (or 
Li over L2) leads to more efficient filtration, validating our study in Figure 1. 

Table- 1 shows the timing comparison, resulting from running 11 different range queries (4- 14) 
on three real datasets. Compared with String and Vector g-gram methods, transformation always 
works faster (including the offline index construction overhead), and gets even more exphcit as the 
dataset grows, for a 11-13 times faster run, while very closely approximating the String g-gram 
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Figure 5: The resulting candidate answer set as a function of query range on Escherichia coli(E.coli) 
database. 




Figure 6: True Positive Rate(TPR) of DFTl^^^ compared with String g-gram method on Alu^ 
Mitochondria and E.coli datasets. 



method. Figure 6 shows the distribution of True Positive Rate ( TP R) of DFTl^^^ compared with 
String^ on the very same range queries as investigated earher. For instance, joining the results of 
Figures 3-6 and Table-1, for the case of Mitochondria dataset, DFTl^^^ may effectively prune up to 
100 - (17.54)-! c=L 99% of the database, for ^^f- ~ 9% of the total time needed for String g-gram 
method, while incurring no false negatives. However, as we mentioned before, the transformation 
application is unfortunately very much data dependent. Therefore, we ran the experiments on 
a much wider range of environments and inspect the performance improvements. Figures 7-9, 
demonstrate the results of producing 100 random query patterns Q, (8 < \Q\ < 64), and performing 
the range queries for 1 < r < \Q\^ however due to space hmitations, only the results for pattern 
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sizes of length 16 and 64 are shown. For all the datasets and on all the experiments, it can be 
observed that the DFTl^^^ transformation does not produce any false negatives up to the range 
query r < |Q| — e, for e < 3, however due to the blocking method used in String method, more 
false positives are expected [4], meaning that what has been seen as false negative for DFTl^^^ 
might not be false negative after all. So apparently, we may effectively use the transformation 
as a preprocessing phase to prune irrelevant sequences for proximity ranges not very close to the 
query pattern's length(r < |Q| — e). Within this domain, DFTl^^^ performs very well. This 
questionable behavior needs further investigation. For this purpose, we investigated every single 
candidate block produced by String^ DFTl^^^ and DWTl^^^ on a random portion of Alu database, 
for some random query patterns of length 16, and performed a range query of 14, which had caused 
the suspicious behavior on aU the datasets, and inspected the corresponding precision and recall. 
On the inspected configurations. String, DFTl^^^ and DWTl^^^ filtrations, reduced the database 
size to 7.75%, 14.08% and 1.41%, respectively. DWTl-^^^ produced false positives, in addition to a 
couple of false negatives! This would need further investigation on a larger scale. However, DFT^^ ,^ 
caught all the actual fc-distant blocks of the database. Thereafter we inspected the recalls. String 
depicted a better performance which was expected, by producing less false positive compared with 
DFTl^^^. However, DFTl^^^ did not miss any actual A:-distant block, while DWTl^^^ generated 
false negatives and misses the correct result. Both DFTl^^^ and String resulted in a precision of 
1, not missing an}^ correct result, in contrary with DWTl^^^. These results are depicted in Figures 
10-13. The calculation of the distance in String was based on the difference on the number of shared 
g-grams, however, we used the frequency vector difference for DFT and DWT. For this reason, 
none of the sets of false positives produced by DFT and String were subset of one or another. The 
intersection of the results produced by them, consisted of aU the fc-distant blocks in addition to a 
few false positives. 

6 Conclusion 

In this paper, we studied the application of Fourier(DFT) and Haar Wavelet(DWT) transforma^ 
tions on biological sequences and evaluated the specific problem of range query. Transformation 
methods (e.g. DFT), may be applied to prune most of the non-desired sequences, and reduce the 
real search problem to only a fraction of the database, as a preprocessing phase for any of the 
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known heuristic approaches like BLAST[1], PatternHunter[ll], QUASAR[4], FastA[17], and even 
the dynamic programming sequence alignment [18, 15]. Our results show that applying the trans- 
formation technique, a high accuracy and faster database pruning is achieved, when the behavior 
of the studied transformations is taken into account before applying the appropriate range query. 
The filtration ratio is very much data dependent and no generalization on the min/max filtration 
ratio or true positive rates can be suggested. However, the empirical results show a promising per- 
formance behavior, specially on DFTl^^^, incurring no false negatives, high filtration ratio, while 
being considerably faster than the g-gram method. We plan to explore these observations on a 
larger scale in our future work. 
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Figure 7: Random query patterns of various length, and range queries on Alu. 
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Random querioa of length a, on E.C0II 
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Random queries of length 64, on E.coli 
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Figure 8: Random query patterns of various length, and range queries on Escherichia coli( E.coli). 



21 




22 




Figure 10: Resulting candidate percentages of the Alu database for String, DFTl^^^, and DWTl^^^. 
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Figure 11: Resulting candidate block distribution of String, DFTl^^^^, and DWTl^^^^. 




23 




string DFT (L1 ,all) Wavelet (L1 .all) 

Selected methods 



Figure 13: Resulting precisions for String, DFTl^,*, and DWTli,o- 
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